Show Code
library(dplyr)
library(ggplot2)
library(treemapify)
library(plotly)
library(tidyr)
library(tidytext)
library(SnowballC)
library(syuzhet)
library(furrr)
library(quarto)Natural language processing (NLP) basically focuses on using computers to analyze and understand text. It converts human language into a format that machines can understand and process. This enables computers to analyze and make sense of human language. There are many techniques used in NLP, such as text classification, topic modeling, and others. For our project we’ll be using the sentiment analysis technique. Sentiment analysis determines the opinion of an author expressed in a piece of text. The text can be categorized as either positive, negative, or neutral. We’ll be analyzing comments from the New York Times for the 2017 year. The data includes then months January through April. The datasets used for our analysis can be found on the Kaggle web site at then following link: New York Times Comments
Code used for our analysis is hidden and can be viewed by expanding the “Show Code” option
library(dplyr)
library(ggplot2)
library(treemapify)
library(plotly)
library(tidyr)
library(tidytext)
library(SnowballC)
library(syuzhet)
library(furrr)
library(quarto)nrow(nyt_comments17)[1] 969655
nyt_comments17 |> glimpse()Rows: 969,655
Columns: 5
$ commentBody <chr> "This project makes me happy to be a 30+ year Times sub…
$ commentID <int> 22022598, 22017350, 22017334, 22015913, 22015466, 22012…
$ commentType <chr> "comment", "comment", "comment", "comment", "comment", …
$ newDesk <chr> "Insider", "Insider", "Insider", "Insider", "Insider", …
$ typeOfMaterial <chr> "News", "News", "News", "News", "News", "News", "News",…
We see from the above outputs that our dataset has just under 970,000 nrows. We will not require the commentType column and will remove it.
nyt_comments17_short<-nyt_comments17 |>
select(-commentType)nyt_comments17_short |>
glimpse()Rows: 969,655
Columns: 4
$ commentBody <chr> "This project makes me happy to be a 30+ year Times sub…
$ commentID <int> 22022598, 22017350, 22017334, 22015913, 22015466, 22012…
$ newDesk <chr> "Insider", "Insider", "Insider", "Insider", "Insider", …
$ typeOfMaterial <chr> "News", "News", "News", "News", "News", "News", "News",…
Now that our data is selected, we will prepare our data for analysis. First a function will be created to clean the text in the commentBody. This will include removing HTML/Markdown tags, backslashes, and replacing multiple spaces with a single space.
clean_comment_text <- function(text) {
text %>%
gsub("<.*?>", " ", .) %>% # Remove HTML/Markdown tags like <br/>
gsub("\\\\", " ", .) %>% # Remove backslashes
gsub("[^\\p{L}\\p{N}\\s!?\\p{Emoji_Presentation}]", " ", ., perl = TRUE) %>%
gsub("\\s+", " ", .) %>% # Replace multiple spaces with single space
trimws()
}nyt_comments_clean<-nyt_comments17_short |>
mutate(clean_comments = clean_comment_text(commentBody))nyt_comments_clean |>
glimpse()Rows: 969,655
Columns: 5
$ commentBody <chr> "This project makes me happy to be a 30+ year Times sub…
$ commentID <int> 22022598, 22017350, 22017334, 22015913, 22015466, 22012…
$ newDesk <chr> "Insider", "Insider", "Insider", "Insider", "Insider", …
$ typeOfMaterial <chr> "News", "News", "News", "News", "News", "News", "News",…
$ clean_comments <chr> "This project makes me happy to be a 30 year Times subs…
nrow(nyt_comments_clean)[1] 969655
Because the dataset is relatively large at 969,655 rows, we will use parallel processing for efficency and speed during sentiment extraction. To accomplish this we will use the future_map_dbl function from the the furrr library. This function helps use cpu cores to parallel tasks. All but one cpu core will be used for our processing.
plan(multisession, workers = parallel::detectCores() - 1)For our sentiment analysis we will need to choose a sentiment lexicon. A lexicon is basically a sentiment dictionary that includes words and the sentiment the word has been tagged with such as negative or positive. We’ll be using the AFINN dictionary, which instead of each work being tagged as negative or positive, they are given a numeric score with a negative score indicating negative sentiment, a positive score indicating a positive sentiment, and a score of zero indicating neutral sentiment. The numeric scores range from -5 to 5. AFINN calculates the compounded score for each sentence in the text by adding the scores (weights) of each of the sentiment words. From this calculated score we can assign a category for each row of text.
get_sentiments("afinn")# A tibble: 2,477 × 2
word value
<chr> <dbl>
1 abandon -2
2 abandoned -2
3 abandons -2
4 abducted -2
5 abduction -2
6 abductions -2
7 abhor -3
8 abhorred -3
9 abhorrent -3
10 abhors -3
# ℹ 2,467 more rows
The first ten rows of te AFINN dictionary are shown above. We see the first column is the word and second is the value holding the sentiment score.
comments17_sentiment<-nyt_comments_clean |>
mutate(
sentiment_score=future_map_dbl(clean_comments, get_sentiment, method = "afinn")
)comments17_sentiment |> glimpse()Rows: 969,655
Columns: 6
$ commentBody <chr> "This project makes me happy to be a 30+ year Times su…
$ commentID <int> 22022598, 22017350, 22017334, 22015913, 22015466, 2201…
$ newDesk <chr> "Insider", "Insider", "Insider", "Insider", "Insider",…
$ typeOfMaterial <chr> "News", "News", "News", "News", "News", "News", "News"…
$ clean_comments <chr> "This project makes me happy to be a 30 year Times sub…
$ sentiment_score <dbl> 5, -2, 8, 0, 4, 5, 4, 13, 3, 0, 4, -3, -5, 0, 1, -5, 7…
comments17_sentiment |>
select(commentID, sentiment_score) |>
head(30) commentID sentiment_score
1 22022598 5
2 22017350 -2
3 22017334 8
4 22015913 0
5 22015466 4
6 22012085 5
7 22003784 4
8 22024897 13
9 22082978 3
10 22004930 0
11 22005135 4
12 22004841 -3
13 22005149 -5
14 22004746 0
15 22005218 1
16 22005228 -5
17 22004632 7
18 22004617 5
19 22004589 -2
20 22004546 4
21 22004815 10
22 22004629 3
23 22005189 -2
24 22005185 5
25 22004932 0
26 22004805 0
27 22004566 -1
28 22004431 -5
29 22004413 1
30 22004294 0
Using the glimpse function we now find that there is a new column named “sentiment_score”. Viewing the first thirty rows we see that a score has been assigned to each row.
Now that we have sentiment scores our next step is to assign a sentiment based on these scores. Negative scores (-5 to -1) will be assigned a sentiment of “negative”, positive scores (1 to 5) will be assigned a sentiment of “positive”, and a score of zero will be assigned “neutral”. This will be accomplished using the case_when function from the dplyr library.
comments17_sentiment<-comments17_sentiment %>%
mutate(sentiment=case_when(
sentiment_score > 0 ~ "positive",
sentiment_score < 0 ~ "negative",
TRUE ~ "neutral"
))comments17_sentiment |>
glimpse()Rows: 969,655
Columns: 7
$ commentBody <chr> "This project makes me happy to be a 30+ year Times su…
$ commentID <int> 22022598, 22017350, 22017334, 22015913, 22015466, 2201…
$ newDesk <chr> "Insider", "Insider", "Insider", "Insider", "Insider",…
$ typeOfMaterial <chr> "News", "News", "News", "News", "News", "News", "News"…
$ clean_comments <chr> "This project makes me happy to be a 30 year Times sub…
$ sentiment_score <dbl> 5, -2, 8, 0, 4, 5, 4, 13, 3, 0, 4, -3, -5, 0, 1, -5, 7…
$ sentiment <chr> "positive", "negative", "positive", "neutral", "positi…
comments17_sentiment |>
select(commentID, sentiment) |>
head(30) commentID sentiment
1 22022598 positive
2 22017350 negative
3 22017334 positive
4 22015913 neutral
5 22015466 positive
6 22012085 positive
7 22003784 positive
8 22024897 positive
9 22082978 positive
10 22004930 neutral
11 22005135 positive
12 22004841 negative
13 22005149 negative
14 22004746 neutral
15 22005218 positive
16 22005228 negative
17 22004632 positive
18 22004617 positive
19 22004589 negative
20 22004546 positive
21 22004815 positive
22 22004629 positive
23 22005189 negative
24 22005185 positive
25 22004932 neutral
26 22004805 neutral
27 22004566 negative
28 22004431 negative
29 22004413 positive
30 22004294 neutral
Our data now has a “sentiment” column reflecting negative, positive, or neutral, base on the sentiment_score coluimn.
Visualization can help us view the percent breakdown of sentiment and compare sentiment by type of material and news desk.
sentiment_tree<-comments17_sentiment |>
count(sentiment) |>
mutate(perc = round(n/sum(n),3)*100)tree_map<-ggplot(sentiment_tree, aes(area=perc, fill = sentiment, label=perc))+
geom_treemap()+
geom_treemap_text()+
ggtitle("Sentiment (Percent)")+
theme(plot.title = element_text(color="black", size=14, face="bold.italic", hjust=0.5))+
scale_fill_discrete(name = "Sentiment")+
scale_y_continuous(labels=scales::percent_format())From the Tree Map plot we find that postive and negative sentiments are virtuall equal at 42.2% and 42.8% respectively.
dot_cnt<-ggplot(comments17_sentiment, aes(x=sentiment,y=typeOfMaterial))+
geom_count(aes(colour=sentiment))gg_dot_cnt<-ggplotly(dot_cnt)Sentiment by Type of Material
Editorial has the greatest difference between negative and positive sentiment.
tile_cnt<-comments17_sentiment |>
group_by(newDesk) |>
count(sentiment, newDesk) |>
mutate(perc=round(n/sum(n),2)*100)g_tile<-tile_cnt |>
ggplot(aes(x=sentiment,y=newDesk))+
geom_tile(aes(fill=perc))+
labs(fill = "Percent")+
xlab("Sentiment")+
theme(axis.title.y=element_blank())gg_gtile<-ggplotly(g_tile)Sentiment by News Desk